Underwater Distortion Target Recognition Network (UDTRNet) via Enhanced Image Features

It is difficult for the autonomous underwater vehicle (AUV) to recognize targets similar to the environment in lacking data labels. Moreover, the complex underwater environment and the refraction of light cause the AUV to be unable to extract the complete significant features of the target. In response to the above problems, this paper proposes an underwater distortion target recognition network (UDTRNet) that can enhance image features. Firstly, this paper extracts the significant features of the image by minimizing the info noise contrastive estimation (InfoNCE) loss. Secondly, this paper constructs the dynamic correlation matrix to capture the spatial semantic relationship of the target and uses the matrix to extract spatial semantic features. Finally, this paper fuses the significant features and spatial semantic features of the target and trains the target recognition model through cross-entropy loss. The experimental results show that the mean average precision (mAP) of the algorithm in this paper increases by 1.52% in recognizing underwater blurred images.


Introduction
Underwater target recognition has difficulties in sample data collection and labeling, making it difficult to obtain labeled sample datasets. Unsupervised representational learning can extract significant features of images from unlabeled datasets and use them for target classification and detection tasks.
is method can improve the accuracy of underwater target recognition effectively in the case of insufficient tags. Moreover, unsupervised representational learning ignores some details of the image and only learns distinguishable features, which can also improve the recognition speed of the algorithm.
However, the scattering and refraction of light in the underwater environment cause the target to be blurred and distorted in the images taken by the AUV. Shoals of fish, currents, and complex underwater terrain can obscure the target. In this case, unsupervised representational learning is unable to extract the complete significant features for target recognition. e graph structure establishes the topology of correlation between nodes through vertices and edges and contains rich spatial semantic information. Semantic relationship graphs can compensate for incomplete significant features.
Graph convolutional networks (GCNs) can extract features of graph structures and gain the spatial semantic relations of targets effectively. However, the spatial semantic relationship graphs of the targets are usually static graphs obtained by computing the label co-occurrence relationships in the whole dataset. In the underwater environment, the number of sample data from different classes in the dataset is unevenly distributed. In this case, a static spatial semantic relationship graph will reduce the generality of the model [1]. Constructing a dynamic spatial semantic relationship graph can improve the robustness of the algorithm.
To address the above problems, this paper proposes an underwater distortion target recognition network (UDTR-Net) via enhanced image features. e method allows fast recognition of underwater distortion targets in the absence of significant features. e following are the main contributions in the methodology of this paper: (1) In this paper, the original sample data are compared with positive and negative samples in the feature space, respectively. e proposed algorithm trains the feature extraction network by minimizing the InfoNCE loss function to extract the visual significant features of the target. is method can improve the target recognition accuracy in the absence of data labels.
(2) is paper adds the label information of the current image in the static correlation matrix. e proposed algorithm constructs the dynamic correlation matrix to represent the spatial semantic relationships of the targets. is matrix can extract dynamic spatial semantic features to compensate for the lack of significant features caused by target distortion and occlusion. (3) e proposed algorithm fuses the significant features and spatial semantic features of the target and trains the target recognition model through cross-entropy loss. e experimental results show that the method effectively solves the problem of low recognition accuracy when the target is distorted and obscured. e rest of this paper is presented as follows. Section 2 describes the related work. Section 3 introduces the visual significant feature extraction model, the spatial semantic feature extraction model, and the underwater distortion target recognition algorithm via enhanced image features. Section 4 verifies the effectiveness of the methods in this paper through simulation experiments. Section 5 concludes the paper.

Related Work
Absorption and scattering of light cause difficulties in underwater image acquisition. It is expensive to produce large annotated underwater datasets. In the absence of data tags, AUV has difficulty in identifying targets that are similar to their environment. Unsupervised representational learning can extract distinguishable features of images using unlabeled data. Wang et al. [2] proposed an adversarial correlated autoencoder (AdvCAE) for unsupervised multiview representation learning. is method eliminates the differences in data from multiple views due to different distributions. Also, Han et al. [3] proposed a semisupervised multiview manifold discriminant intact space (SM2DIS) learning method for image classification. is method learns the complete feature representation by multiview data. Le-Khac et al. [4] summarized the existing literature on contrast learning and proposed a generalized framework for contrast representation learning. e framework simplifies and unifies many different contrast learning algorithms and addresses the application of the contrast learning framework to the field of computer vision. Chen et al. [5] extended the existing contrast learning algorithm by embedding an attention mechanism and proposed an attention-augmented contrastive (A2C) learning method. e method can improve the learning efficiency and generalization ability of the algorithm. Li et al. [6] proposed an intermediate-level feature representation framework for unsupervised representation learning via sparse autoencoders. Experimental results show that the method reduces the number of parameters for unsupervised representation learning. e complex underwater environment and light refraction make it difficult for the AUV to extract the complete significant features of the target. e images contain rich spatial semantic relationships. Su et al. [7] proposed a new multigraph embedding discriminative correlation feature learning algorithm. e method captures the intrinsic geometric structure of each view and learns nonlinear correlation features with good recognition ability. Ma et al. [8] proposed a multiscale spatial context-based deep network for semantic edge detection (MSC-SED). e network obtains rich multiscale features while enhancing high-level feature details. Yang et al. [9] proposed to combine structured semantic relevance to solve the problem of missing labels in multilabel learning. Zhao et al. [10] designed a multitasking framework to jointly handle the weather clues to the segmentation task and the weather classification task.
is method solves the problem of poor performance of single weather tag classification. Khan et al. [11] proposed a new multilabeled deep GCN. e network can extract discriminative features from the irregular structure to enhance the classification results. Nauata et al. [12] proposed to model the complex relationships between labels through a structured inference neural network. e experimental results show that the method improves the applicability and robustness of the algorithm. Chen and Gupta [13] proposed a spatial memory network (SMN) that can model instancelevel contexts. is method can improve the target detection accuracy by using the contextual relationship of the object.
For the problems of underwater environment interference and algorithm real time, Cai et al. [14] proposed a collaborative multi-AUV target recognition method based on migration reinforcement learning. Zhang et al. [15] proposed a semantic spatial fusion network (SSFNet) to bridge the gap between low-level and high-level features. Moniruzzaman et al. [16] proposed a Faster R-CNN algorithm using the Inception V2 network. is method can improve the average detection accuracy of the algorithm in the case of a small difference between the target and the surrounding boundary. Wang et al. [17] proposed a multiview visual-semantic representation method for fewlabeled visual recognition (MV 2 S). e method uses the visual and semantic representation of the image to predict the class of the image. To improve the convergence speed of the algorithm, Cai et al. [18] designed an effective outer space acceleration algorithm. Sun and Cai [19] proposed a multi-AUV target recognition method based on GANmeta learning. e experiment result shows that this method can improve the generalization ability of the model. Cai et al. [20] proposed a maneuvering target recognition method based on multiview optical field reconstruction. is method can ignore the effect of shooting angle on target recognition results. Chen et al. [21] proposed a new iterative visual inference framework. e framework effectively improves the target recognition accuracy. To solve the problem of data double-computation, Cai et al. [22] proposed a multiview optical field reconstruction method based on migration reinforcement learning.

Proposed Method
is paper proposes the UDTRNet that can enhance image features. is method can make up for the lack of visual significant features through the spatial semantic features of the target. Firstly, the proposed algorithm trains the feature extraction network through the InfoNCE loss function to extract the visual significant features of the target. en, the dynamic correlation matrix is constructed to represent the spatial semantic relationship of the target, and the spatial semantic features of the target are extracted through this matrix. Finally, this paper fuses the significant features and spatial semantic features of the target and trains the target recognition model through cross-entropy loss. e algorithm effectively solves the problem of low accuracy of target recognition under interference such as distortion and occlusion. e overall process of the algorithm is shown in Figure 1.

Visual Significant Feature Extraction Model.
Unsupervised representation learning can ignore some details of the image. is paper trains a significant feature extraction model to extract distinguishable feature representations of images. e training process is shown in Figure 2.
is paper uses ResNet as the network structure of the significant feature extraction model f(·). e last fully connected layer of the network outputs a 128-dimensional feature vector. e representation h of the image is obtained by normalizing the feature vector, which is expressed as h � f(X). en, the characterization vector of the image is nonlinearly projected into the vector z through the fully connected layer g(·).
is method can amplify invariant features and enhance the ability of the network to recognize targets in different views.
is paper trains the coding network f(·) by minimizing the loss function.
In this paper, N original images are randomly enhanced. e images in the real underwater scene have the characteristics of blurring, distortion, and incompleteness. is paper uses random cropping, random color distortion, and random Gaussian blur to obtain enhanced samples of the original image. e number of negative samples is the important factor affecting model representation learning. is paper constructs the feature library to store all the enhanced samples in the training process. For the images input by the feature extraction network, there are positive samples k + from the same image as the input samples and negative samples k − from different images in the feature library. e significant feature extraction network maximizes the consistency among different enhanced views of the same image and minimizes the consistency among enhanced views of different images.
is method can learn the characterization of image distinguishability.
is paper designs a loss function so that the representation of the input image is similar to the positive samples and not similar to the negative samples. e similarity of the images is expressed by the cosine similarity of the feature vectors, which is calculated as follows: where z q � g(h q ) denotes the nonlinear projection of the representation vector of the input image and z k denotes the nonlinear projection of positive or negative sample representations in the feature library. e loss function of the significant feature extraction model is given by where q is the feature representation of the input image, k + is the feature representation of positive samples, k − is the feature representation of negative samples, and τ is used to zoom in on the similarity metric of the image representation. e feature library can make the number of negative samples larger and improve the training effect. However, the phenomenon also increases the difficulty in updating the feature library encoder f k . is paper dynamically updates the feature library encoder f k by the encoder f q of the input samples.
e parameters of the encoder f q and f k are denoted as θ q and θ k , respectively. θ k is updated as follows: where the momentum coefficient m ∈ [0, 1). During the training process, θ q updates the parameters by stochastic gradient descent. When θ q is updated, θ k updates the parameters according to the above process. After completing the training, the encoder f q can extract the significant features of the image. e significant features of the images are as follows: where x is the input test image and f q is the encoder with completed training.

Spatial Semantic Feature Extraction
Model. e target in the underwater image is distorted. e algorithm is unable to extract the significant features of the target completeness.
is phenomenon can reduce the accuracy of target recognition. is paper extracts the spatial semantic features among nodes by edge traversal and updating the nodes in the graph.
e spatial semantic feature extraction model is shown in Figure 3.
is paper constructs the spatial semantic relation graph G � V, E { } for the target, where V is the set of nodes and E is the edge set. e node indicates the category of the target. e edges represent the spatial semantic relationships among different targets. Assume that the dataset includes C target categories.
e set of nodes V can be represented as v 0 , v 1 , . . . , v C−1 . e element v c indicates the category c. e edge set E is the correlation matrix that can represent the Computational Intelligence and Neuroscience correlation among different objectives. However, the static correlation matrix mainly explains the co-occurrence of labels in the training dataset. e correlation matrix of each input image is fixed. is matrix does not explicitly utilize the content of each input image. is paper constructs the local correlation matrix B for each specific input image. e global correlation matrix and the local correlation matrix are fused as the overall correlation matrix. e results are as follows: where ω E and ω B denote the weights. e element a cc′ denotes the probability of having both target c ′ and target c in the image, i.e., the correlation between target c ′ and target c. is paper uses the labels of the training set to calculate the correlation between different category pairs in the input images.
e spatial semantic relationship of the target is learned through the spatial semantic relationship diagram. Each node v c has a correlation h t c at time step t. is parameter indicates the degree of correlation among the node and other nodes. In this article, each node corresponds to a specific target category. e spatial semantic feature extraction model aims to learn the spatial semantic relationship among e model encourages the dissemination of information among highly correlated nodes.
is paper learns spatial semantic relations through information transfer in graphs. e proposed algorithm updates the spatial semantic relations of the target by aggregating the feature vector a t c . e iterative process is as follows: where σ(·) is a logarithmic sigmoid function, tanh(·) is a hyperbolic tangent function, and ⊙ denotes the multiplication operator between elements. e target node aggregates the information of surrounding nodes to achieve the interaction between the feature vectors corresponding to different nodes. e iterative process lasts for T times. e obtained spatial semantic relation is

Underwater Distortion Target Recognition Method via Enhanced Image Features.
is section extracts the candidate regions of the target on the visual significant feature map of the image. e proposed algorithm fuses visual significant features and spatial semantic features to accomplish target recognition. is paper obtains the target anchor boxes by sliding the window on the significant feature f. e window size is 3 * 3. e algorithm predicts multiple target anchor boxes simultaneously in each window. e maximum number of anchor boxes per position is denoted as k. Each anchor box maps a low-dimensional feature. e features are input to the classification (cls) layer and the regression (reg) layer. e reg layer outputs the coordinates of the vertices of the k-group anchor boxes. e cls layer outputs the label and confidence level of the anchor box. For the feature mapping of W × H, the proposed method generates k × W × H target anchor boxes. is paper indicates the prediction accuracy of the model by intersection over union (IoU). e model assigns a binary label to each target candidate frame. Candidate boxes with IoU greater than 0.7 are positive labels. Candidate boxes with IoU less than 0.3 are negative labels. If there is no anchor box with IoU greater than 0.7, the algorithm selects the candidate box with the largest IoU as the positive label. In addition, nonpositive and negative labels and candidate frames that cross the image boundary are of no value to the training for the model. is article deletes it to save calculation time. is paper considers anchor boxes as nodes in the semantic relationship graph. e proposed method fuses the significant features f c of nodes and spatial semantic features h t c to predict the target types of nodes. e fused features are represented as where F P is a feature fusion output function. is function maps f c and h T c to the feature vector P c . e feature vector P c includes the significant features and spatial semantic information of the target. is paper feeds this feature vector into a fully connected classification layer to predict the target category score. Computational Intelligence and Neuroscience e cls layer of the model is used for object classification and outputs the discrete probability distribution for each anchor box. e cls layer outputs a C + 1-dimensional array S. is array represents the probability that the object belongs to C categories and background. e array S is usually calculated by the fully connected layer using the SoftMax function.
is paper trains the model by minimizing the loss function.
e loss function consists of two components: classification loss and regression loss. e calculation is shown as follows: where i denotes the number of anchor boxes. c is the target category. s c i denotes the predicted probability of the target type in anchor box i. s * i is the real label of anchor box i. T i denotes the coordinates of the four vertices of the target anchor box. T * i is the vertex coordinates of the real target region. R is the smooth L1 function. s c i and T i are given by the classification and regression layers. N cls and N reg denote the normalization of the loss function. N cls is numerically equal to the minimum batch size for training. N reg is equal to the number of target anchor boxes. λ is the weight. σ(·) is the sigmoid function.

Experimental Results and Analysis
In this experiment, training and testing are performed in TensorFlow. e simulation calculation runs on small server (RTX 2080Ti GPU, 64G of RAM, and Win10 64-bit operating system).

Experimental Dataset.
In this paper, the three datasets, Cognitive Autonomous Diving Buddy (CADDY) underwater dataset, Underwater Image Enhancement Benchmark (UIEB), and Underwater Target dataset (UTD) are used for training and testing. e visual salient feature extraction model is trained by 13,000 unlabeled images. In addition, 426 labeled images are used to train and test the spatial semantic feature extraction model and target recognition network. e dataset is divided into training set and test set according to the ratio of 6.5 : 3.5.

Implementation Details.
e model is trained by the stochastic gradient descent (SGD) optimizer with a weight decay of 0.0005 and a momentum of 0.9. e training batch is 256 and the initial learning rate is 0.01. e whole training process is iterated 70,000 times, in which the learning rate decays at 56,000 and 63,000 iterations with a decay rate of 0.1.  Computational Intelligence and Neuroscience

Experimental Results.
e model proposed in this paper fuses visual significant features and spatial semantic features for target recognition. is method can solve the problem of low recognition accuracy under interference such as target distortion and blurring. For underwater images with different disturbances, this section designs three sets of simulation experiments to verify the effectiveness of the proposed algorithm. e algorithm evaluation criteria are mAP and recognition time.
For underwater images with different interferences, three sets of simulation experiments are designed to verify the effectiveness of the proposed algorithm.

Conventional Underwater Image Recognition Results.
is section evaluates the recognition performance of the proposed algorithm in conventional underwater images and compares it with FFBNet [23], SiamFPN [24], SA-FPN [25], and Faster R-CNN [26]. e recognition results are shown in Figure 4, and the recognition accuracy and recognition speed are shown in Table 1.
As can be seen from Table 1, FFBNet takes 0.09 s to recognize an underwater image, which has the fastest recognition speed among all the compared algorithms. However, the mAP of the algorithm is only 0.7132. On the contrary, Faster R-CNN has the highest mAP and the lowest recognition speed, respectively, 0.7466 and 0.397 s. SiamFPN and SA-FPN greatly reduce the recognition time at the expense of partial accuracy. e proposed algorithm in this paper can better balance the recognition speed and accuracy. In the recognition of conventional underwater images, the overall performance of this proposed algorithm is better than that of FFBNet and Faster R-CNN. At the same time, compared with SiamFPN and SA-FPN, the proposed algorithm has lower recognition time and higher recognition accuracy.

Underwater Blurred Image Recognition Results.
is section evaluates the performance of the proposed algorithm to recognize underwater blurred images and compares it with FFBNet [23], SiamFPN [24], SA-FPN [25],  Computational Intelligence and Neuroscience and Faster R-CNN [26]. e recognition results are shown in Figure 5. e recognition accuracy and recognition speed of each algorithm are shown in Table 2.
As can be seen in Table 2, the accuracy of each algorithm in recognizing underwater blurred images has decreased. e mAP of the algorithm in this paper is 0.6652, which is ahead of other comparison algorithms. Compared with the state-of-the-art target recognition algorithm SA-FPN, the mAP of the proposed algorithm is improved by 1.52%. Moreover, the algorithm in this paper has a great lead in    Computational Intelligence and Neuroscience identifying torpedo, frogman, and submarine targets. e reason is that the method in this paper can enhance the target features through spatial semantic relations.

Underwater Distortion Image Recognition Results.
is section evaluates the performance of the proposed algorithm to recognize underwater distorted images and compares it with FFBNet [23], SiamFPN [24], SA-FPN [25], and Faster R-CNN [26]. e recognition results are shown in Figure 6. e recognition accuracy and recognition speed of each algorithm are shown in Table 3.
As can be seen from Table 3, the SiamFPN algorithm has the best recognition effect on underwater distorted images. e recognition accuracy is 0.6652, and the recognition speed is 0.225 s. ough the average recognition accuracy of the algorithm in this paper is 1.82% lower than that of SiamFPN, the algorithm has faster recognition speed. is paper also analyzes the recognition results of single-type targets. e algorithm in this paper is more effective in identifying distorted torpedo and frogman targets.

Conclusion
In the case of many underwater interferences, it is difficult for AUVs to extract the complete significant features of the target. is paper uses spatial semantic features to make up for the lack of distinctive visual features. Firstly, this paper extracts the significant features of the image by minimizing the InfoNCE loss. Secondly, this paper constructs the dynamic correlation matrix to capture the spatial semantic relationship of the target and uses the matrix to extract spatial semantic features. Finally, this paper fuses the salient features and spatial semantic features of the target and then trains the target recognition model through cross-entropy loss. In the recognition of underwater conventional images and distorted images, the comprehensive performance of the algorithm in this paper is better than that of existing algorithms. When recognizing underwater blurred images, the mAP of the algorithm in this paper is improved by 1.52% compared with the existing algorithm.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.